We’ll be use Rstudio and R for this tutorial.
In particular, we’ll be using the following packages:
The following code just installs any packages that are not already installed on your system.
pkgs = c(
"ggplot2",
"GEOquery",
"SummarizedExperiment",
"ComplexHeatmap",
"GGally"
)
ins = installed.packages(repos = BiocManager::repositories())
for(pkg in pkgs) {
if(!(pkg %in% rownames(ins)))
BiocManager::install(pkg)
}
Recall that NCBI GEO is a large database that contains publicly available and free gene expression data. We’ll assume here that you have already explored the literature and the GEO data available and have settled on the dataset entitled, “Gene expression profiles of breast, colorectal, prostate, and non-small cell lung cancer.” The dataset details are available here:
The associated publication, “Regulatory T-cell Genes Drive Altered Immune Microenvironment in Adult Solid Cancers and Allow for Immune Contextual Patient Subtyping,” is available here:
Once we have identified a dataset of interest, we can access it using GEOquery.
library(GEOquery)
# The [[1]] in the following line it because
# some GEO records include more than one dataset.
# In this cae, we just want the first (and only)
# dataset.
gse = getGEO("GSE103512")[[1]]
GEOquery was written many years ago. Since that time, the
Bioconductor data structures for storing gene expression data have been
improved and updated. The following code chunk just updates the gene
expression data to use the new SummarizedExperiment
container.
library(SummarizedExperiment)
se = as(gse, "SummarizedExperiment")
We can get a quick look at what is included by simply printing the
se variable:
print(se)
## class: SummarizedExperiment
## dim: 54715 280
## metadata(3): experimentData annotation protocolData
## assays(1): exprs
## rownames(54715): 1007_PM_s_at 1053_PM_at ... AFFX-TrpnX-5_at
## AFFX-TrpnX-M_at
## rowData names(16): ID GB_ACC ... Gene.Ontology.Cellular.Component
## Gene.Ontology.Molecular.Function
## colnames(280): GSM2772660 GSM2772661 ... GSM2772938 GSM2772939
## colData names(72): title geo_accession ... tumorsize.ch1 weight.ch1
Questions
From the output of the print(se) statement:
Hint: Look at the dimensions listed in the output.
Remember that the data in Bioconductor are now in the form of a
SummarizedExperiment. You can read more about
SummarizedExperiments here.
An overview of the data structure is portrayed in the following figure.
In particular, we might want to look at the information about the
samples. To do so, we can use the colData method.
colData(se)
## DataFrame with 280 rows and 72 columns
## title geo_accession status
## <character> <character> <character>
## GSM2772660 BC_coex_path_001 GSM2772660 Public on Sep 07 2017
## GSM2772661 BC_coex_path_002 GSM2772661 Public on Sep 07 2017
## GSM2772662 BC_coex_path_003 GSM2772662 Public on Sep 07 2017
## GSM2772663 BC_coex_path_004 GSM2772663 Public on Sep 07 2017
## GSM2772664 BC_coex_path_005 GSM2772664 Public on Sep 07 2017
## ... ... ... ...
## GSM2772935 PC_coex_path_002_nor.. GSM2772935 Public on Sep 07 2017
## GSM2772936 PC_coex_path_003_nor.. GSM2772936 Public on Sep 07 2017
## GSM2772937 PC_coex_path_009_nor.. GSM2772937 Public on Sep 07 2017
## GSM2772938 PC_coex_path_004_nor.. GSM2772938 Public on Sep 07 2017
## GSM2772939 PC_coex_path_010_nor.. GSM2772939 Public on Sep 07 2017
## submission_date last_update_date type channel_count
## <character> <character> <character> <character>
## GSM2772660 Sep 05 2017 Jan 23 2018 RNA 1
## GSM2772661 Sep 05 2017 Jan 23 2018 RNA 1
## GSM2772662 Sep 05 2017 Jan 23 2018 RNA 1
## GSM2772663 Sep 05 2017 Jan 23 2018 RNA 1
## GSM2772664 Sep 05 2017 Jan 23 2018 RNA 1
## ... ... ... ... ...
## GSM2772935 Sep 05 2017 Jan 23 2018 RNA 1
## GSM2772936 Sep 05 2017 Jan 23 2018 RNA 1
## GSM2772937 Sep 05 2017 Jan 23 2018 RNA 1
## GSM2772938 Sep 05 2017 Jan 23 2018 RNA 1
## GSM2772939 Sep 05 2017 Jan 23 2018 RNA 1
## source_name_ch1 organism_ch1 characteristics_ch1
## <character> <character> <character>
## GSM2772660 Total RNA Seq profil.. Homo sapiens cancer type: BC
## GSM2772661 Total RNA Seq profil.. Homo sapiens cancer type: BC
## GSM2772662 Total RNA Seq profil.. Homo sapiens cancer type: BC
## GSM2772663 Total RNA Seq profil.. Homo sapiens cancer type: BC
## GSM2772664 Total RNA Seq profil.. Homo sapiens cancer type: BC
## ... ... ... ...
## GSM2772935 Total RNA Seq profil.. Homo sapiens cancer type: PCA
## GSM2772936 Total RNA Seq profil.. Homo sapiens cancer type: PCA
## GSM2772937 Total RNA Seq profil.. Homo sapiens cancer type: PCA
## GSM2772938 Total RNA Seq profil.. Homo sapiens cancer type: PCA
## GSM2772939 Total RNA Seq profil.. Homo sapiens cancer type: PCA
## characteristics_ch1.1 characteristics_ch1.2 characteristics_ch1.3
## <character> <character> <character>
## GSM2772660 normal: no batch i/ii: I histology: na
## GSM2772661 normal: no batch i/ii: I histology: na
## GSM2772662 normal: no batch i/ii: I histology: na
## GSM2772663 normal: no batch i/ii: I histology: na
## GSM2772664 normal: no batch i/ii: I histology: na
## ... ... ... ...
## GSM2772935 normal: yes batch i/ii: II histology: na
## GSM2772936 normal: yes batch i/ii: II histology: na
## GSM2772937 normal: yes batch i/ii: II histology: na
## GSM2772938 normal: yes batch i/ii: II histology: na
## GSM2772939 normal: yes batch i/ii: II histology: na
## characteristics_ch1.4 characteristics_ch1.5 characteristics_ch1.6
## <character> <character> <character>
## GSM2772660 age: na gender: na weight: na
## GSM2772661 age: na gender: na weight: na
## GSM2772662 age: na gender: na weight: na
## GSM2772663 age: na gender: na weight: na
## GSM2772664 age: na gender: na weight: na
## ... ... ... ...
## GSM2772935 age: na gender: na weight: na
## GSM2772936 age: na gender: na weight: na
## GSM2772937 age: na gender: na weight: na
## GSM2772938 age: na gender: na weight: na
## GSM2772939 age: na gender: na weight: na
## characteristics_ch1.7 characteristics_ch1.8 characteristics_ch1.9
## <character> <character> <character>
## GSM2772660 bodyheight: na bmi: na diseasecode: na
## GSM2772661 bodyheight: na bmi: na diseasecode: na
## GSM2772662 bodyheight: na bmi: na diseasecode: na
## GSM2772663 bodyheight: na bmi: na diseasecode: na
## GSM2772664 bodyheight: na bmi: na diseasecode: na
## ... ... ... ...
## GSM2772935 bodyheight: na bmi: na diseasecode: na
## GSM2772936 bodyheight: na bmi: na diseasecode: na
## GSM2772937 bodyheight: na bmi: na diseasecode: na
## GSM2772938 bodyheight: na bmi: na diseasecode: na
## GSM2772939 bodyheight: na bmi: na diseasecode: na
## characteristics_ch1.10 characteristics_ch1.11 characteristics_ch1.12
## <character> <character> <character>
## GSM2772660 diseasedescription: na primary / relapse tu.. tumorsize: na
## GSM2772661 diseasedescription: na primary / relapse tu.. tumorsize: na
## GSM2772662 diseasedescription: na primary / relapse tu.. tumorsize: na
## GSM2772663 diseasedescription: na primary / relapse tu.. tumorsize: na
## GSM2772664 diseasedescription: na primary / relapse tu.. tumorsize: na
## ... ... ... ...
## GSM2772935 diseasedescription: na primary / relapse tu.. tumorsize: na
## GSM2772936 diseasedescription: na primary / relapse tu.. tumorsize: na
## GSM2772937 diseasedescription: na primary / relapse tu.. tumorsize: na
## GSM2772938 diseasedescription: na primary / relapse tu.. tumorsize: na
## GSM2772939 diseasedescription: na primary / relapse tu.. tumorsize: na
## characteristics_ch1.13 characteristics_ch1.14 characteristics_ch1.15
## <character> <character> <character>
## GSM2772660 ischemiatime: na tumorcontent: na organ: na
## GSM2772661 ischemiatime: na tumorcontent: na organ: na
## GSM2772662 ischemiatime: na tumorcontent: na organ: na
## GSM2772663 ischemiatime: na tumorcontent: na organ: na
## GSM2772664 ischemiatime: na tumorcontent: na organ: na
## ... ... ... ...
## GSM2772935 ischemiatime: na tumorcontent: na organ: na
## GSM2772936 ischemiatime: na tumorcontent: na organ: na
## GSM2772937 ischemiatime: na tumorcontent: na organ: na
## GSM2772938 ischemiatime: na tumorcontent: na organ: na
## GSM2772939 ischemiatime: na tumorcontent: na organ: na
## characteristics_ch1.16 characteristics_ch1.17 characteristics_ch1.18
## <character> <character> <character>
## GSM2772660 tumorlocalization: na Stage: na radicality: na
## GSM2772661 tumorlocalization: na Stage: na radicality: na
## GSM2772662 tumorlocalization: na Stage: na radicality: na
## GSM2772663 tumorlocalization: na Stage: na radicality: na
## GSM2772664 tumorlocalization: na Stage: na radicality: na
## ... ... ... ...
## GSM2772935 tumorlocalization: na Stage: na radicality: na
## GSM2772936 tumorlocalization: na Stage: na radicality: na
## GSM2772937 tumorlocalization: na Stage: na radicality: na
## GSM2772938 tumorlocalization: na Stage: na radicality: na
## GSM2772939 tumorlocalization: na Stage: na radicality: na
## characteristics_ch1.19 characteristics_ch1.20 treatment_protocol_ch1
## <character> <character> <character>
## GSM2772660 grading: na dignity: na Tumor samples were f..
## GSM2772661 grading: na dignity: na Tumor samples were f..
## GSM2772662 grading: na dignity: na Tumor samples were f..
## GSM2772663 grading: na dignity: na Tumor samples were f..
## GSM2772664 grading: na dignity: na Tumor samples were f..
## ... ... ... ...
## GSM2772935 grading: na dignity: na Tumor samples were f..
## GSM2772936 grading: na dignity: na Tumor samples were f..
## GSM2772937 grading: na dignity: na Tumor samples were f..
## GSM2772938 grading: na dignity: na Tumor samples were f..
## GSM2772939 grading: na dignity: na Tumor samples were f..
## growth_protocol_ch1 molecule_ch1 extract_protocol_ch1
## <character> <character> <character>
## GSM2772660 Samples were extract.. total RNA Extraction was done ..
## GSM2772661 Samples were extract.. total RNA Extraction was done ..
## GSM2772662 Samples were extract.. total RNA Extraction was done ..
## GSM2772663 Samples were extract.. total RNA Extraction was done ..
## GSM2772664 Samples were extract.. total RNA Extraction was done ..
## ... ... ... ...
## GSM2772935 Samples were extract.. total RNA Extraction was done ..
## GSM2772936 Samples were extract.. total RNA Extraction was done ..
## GSM2772937 Samples were extract.. total RNA Extraction was done ..
## GSM2772938 Samples were extract.. total RNA Extraction was done ..
## GSM2772939 Samples were extract.. total RNA Extraction was done ..
## label_ch1 label_protocol_ch1 taxid_ch1
## <character> <character> <character>
## GSM2772660 biotin Biotin-Labeling and .. 9606
## GSM2772661 biotin Biotin-Labeling and .. 9606
## GSM2772662 biotin Biotin-Labeling and .. 9606
## GSM2772663 biotin Biotin-Labeling and .. 9606
## GSM2772664 biotin Biotin-Labeling and .. 9606
## ... ... ... ...
## GSM2772935 biotin Biotin-Labeling and .. 9606
## GSM2772936 biotin Biotin-Labeling and .. 9606
## GSM2772937 biotin Biotin-Labeling and .. 9606
## GSM2772938 biotin Biotin-Labeling and .. 9606
## GSM2772939 biotin Biotin-Labeling and .. 9606
## hyb_protocol scan_protocol data_processing
## <character> <character> <character>
## GSM2772660 3 ug (15 ul) labeled.. Scanning was done us.. The CEL files were n..
## GSM2772661 3 ug (15 ul) labeled.. Scanning was done us.. The CEL files were n..
## GSM2772662 3 ug (15 ul) labeled.. Scanning was done us.. The CEL files were n..
## GSM2772663 3 ug (15 ul) labeled.. Scanning was done us.. The CEL files were n..
## GSM2772664 3 ug (15 ul) labeled.. Scanning was done us.. The CEL files were n..
## ... ... ... ...
## GSM2772935 3 ug (15 ul) labeled.. Scanning was done us.. The CEL files were n..
## GSM2772936 3 ug (15 ul) labeled.. Scanning was done us.. The CEL files were n..
## GSM2772937 3 ug (15 ul) labeled.. Scanning was done us.. The CEL files were n..
## GSM2772938 3 ug (15 ul) labeled.. Scanning was done us.. The CEL files were n..
## GSM2772939 3 ug (15 ul) labeled.. Scanning was done us.. The CEL files were n..
## platform_id contact_name contact_department
## <character> <character> <character>
## GSM2772660 GPL13158 Wei-Yi,,Cheng Pharma Research and ..
## GSM2772661 GPL13158 Wei-Yi,,Cheng Pharma Research and ..
## GSM2772662 GPL13158 Wei-Yi,,Cheng Pharma Research and ..
## GSM2772663 GPL13158 Wei-Yi,,Cheng Pharma Research and ..
## GSM2772664 GPL13158 Wei-Yi,,Cheng Pharma Research and ..
## ... ... ... ...
## GSM2772935 GPL13158 Wei-Yi,,Cheng Pharma Research and ..
## GSM2772936 GPL13158 Wei-Yi,,Cheng Pharma Research and ..
## GSM2772937 GPL13158 Wei-Yi,,Cheng Pharma Research and ..
## GSM2772938 GPL13158 Wei-Yi,,Cheng Pharma Research and ..
## GSM2772939 GPL13158 Wei-Yi,,Cheng Pharma Research and ..
## contact_institute contact_address contact_city
## <character> <character> <character>
## GSM2772660 Roche Innovation Cen.. Roche Translational .. New York
## GSM2772661 Roche Innovation Cen.. Roche Translational .. New York
## GSM2772662 Roche Innovation Cen.. Roche Translational .. New York
## GSM2772663 Roche Innovation Cen.. Roche Translational .. New York
## GSM2772664 Roche Innovation Cen.. Roche Translational .. New York
## ... ... ... ...
## GSM2772935 Roche Innovation Cen.. Roche Translational .. New York
## GSM2772936 Roche Innovation Cen.. Roche Translational .. New York
## GSM2772937 Roche Innovation Cen.. Roche Translational .. New York
## GSM2772938 Roche Innovation Cen.. Roche Translational .. New York
## GSM2772939 Roche Innovation Cen.. Roche Translational .. New York
## contact_state contact_zip.postal_code contact_country
## <character> <character> <character>
## GSM2772660 New York 10016 USA
## GSM2772661 New York 10016 USA
## GSM2772662 New York 10016 USA
## GSM2772663 New York 10016 USA
## GSM2772664 New York 10016 USA
## ... ... ... ...
## GSM2772935 New York 10016 USA
## GSM2772936 New York 10016 USA
## GSM2772937 New York 10016 USA
## GSM2772938 New York 10016 USA
## GSM2772939 New York 10016 USA
## supplementary_file data_row_count age.ch1 batch.i.ii.ch1
## <character> <character> <character> <character>
## GSM2772660 ftp://ftp.ncbi.nlm.n.. 54715 na I
## GSM2772661 ftp://ftp.ncbi.nlm.n.. 54715 na I
## GSM2772662 ftp://ftp.ncbi.nlm.n.. 54715 na I
## GSM2772663 ftp://ftp.ncbi.nlm.n.. 54715 na I
## GSM2772664 ftp://ftp.ncbi.nlm.n.. 54715 na I
## ... ... ... ... ...
## GSM2772935 ftp://ftp.ncbi.nlm.n.. 54715 na II
## GSM2772936 ftp://ftp.ncbi.nlm.n.. 54715 na II
## GSM2772937 ftp://ftp.ncbi.nlm.n.. 54715 na II
## GSM2772938 ftp://ftp.ncbi.nlm.n.. 54715 na II
## GSM2772939 ftp://ftp.ncbi.nlm.n.. 54715 na II
## bmi.ch1 bodyheight.ch1 cancer.type.ch1 dignity.ch1
## <character> <character> <character> <character>
## GSM2772660 na na BC na
## GSM2772661 na na BC na
## GSM2772662 na na BC na
## GSM2772663 na na BC na
## GSM2772664 na na BC na
## ... ... ... ... ...
## GSM2772935 na na PCA na
## GSM2772936 na na PCA na
## GSM2772937 na na PCA na
## GSM2772938 na na PCA na
## GSM2772939 na na PCA na
## diseasecode.ch1 diseasedescription.ch1 gender.ch1 grading.ch1
## <character> <character> <character> <character>
## GSM2772660 na na na na
## GSM2772661 na na na na
## GSM2772662 na na na na
## GSM2772663 na na na na
## GSM2772664 na na na na
## ... ... ... ... ...
## GSM2772935 na na na na
## GSM2772936 na na na na
## GSM2772937 na na na na
## GSM2772938 na na na na
## GSM2772939 na na na na
## histology.ch1 ischemiatime.ch1 normal.ch1 organ.ch1
## <character> <character> <character> <character>
## GSM2772660 na na no na
## GSM2772661 na na no na
## GSM2772662 na na no na
## GSM2772663 na na no na
## GSM2772664 na na no na
## ... ... ... ... ...
## GSM2772935 na na yes na
## GSM2772936 na na yes na
## GSM2772937 na na yes na
## GSM2772938 na na yes na
## GSM2772939 na na yes na
## primary...relapse.tumor.ch1 radicality.ch1 Stage.ch1
## <character> <character> <character>
## GSM2772660 na na na
## GSM2772661 na na na
## GSM2772662 na na na
## GSM2772663 na na na
## GSM2772664 na na na
## ... ... ... ...
## GSM2772935 na na na
## GSM2772936 na na na
## GSM2772937 na na na
## GSM2772938 na na na
## GSM2772939 na na na
## tumorcontent.ch1 tumorlocalization.ch1 tumorsize.ch1 weight.ch1
## <character> <character> <character> <character>
## GSM2772660 na na na na
## GSM2772661 na na na na
## GSM2772662 na na na na
## GSM2772663 na na na na
## GSM2772664 na na na na
## ... ... ... ... ...
## GSM2772935 na na na na
## GSM2772936 na na na na
## GSM2772937 na na na na
## GSM2772938 na na na na
## GSM2772939 na na na na
That is a lot of output to look through. In Rstudio, we can use the interactive viewer to look through the sample information more easily.
View(colData(se))
Questions
I’ve searched through the columns and picked out two columns that contain the information I want to use:
cancer.type.ch1normal.ch1To look at the breakdown of samples based on
both the tumor type and whether the sample was
derived from the tumor itself or non-tumor tissue nearby. The
table function can do that.
with(colData(se),table(`cancer.type.ch1`,`normal.ch1`))
## normal.ch1
## cancer.type.ch1 no yes
## BC 65 10
## CRC 57 12
## NSCLC 60 9
## PCA 60 7
Questions
Hint: Tumor types are in rows and
normal/tumor status is in columns.
Recall that a “heatmap” is a combination of three pieces to make a plot of gene expression data.
In some cases, this kind of plot will show a structure of blocks of color that may represent genes that are different between sets of samples. Without the clustering of samples and genes, such structure in the data will not be visible.
Before creating a heatmap, I am going to narrow down the data a bit to include the top 500 most variable genes. The rational for doing so is two-fold.
The following block of code uses the sd function on each
row to calculate the Standard Deviation (abbreviated “sd”). Then, we
select the 500 genes with the largest standard deviation
## ----sds,cache=TRUE,echo=TRUE-------------------------------------------------
sds = apply(assay(se, 'exprs'),1,sd)
dat = assay(se, 'exprs')[order(sds,decreasing = TRUE)[1:500],]
The variable dat is a matrix that contains
the resulting top 500 most variable genes for each sample. We can
double-check by looking at the dimensions of dat.
dim(dat)
## [1] 500 280
I also want to create a small dataset that contains the tumor type and the tumor/normal status. I’ll create that using the following code:
sampdat = data.frame(
type=factor(colData(se)[,'cancer.type.ch1']),
isnormal = factor(colData(se)[,'normal.ch1'])
)
We can take a quick look at these data:
head(sampdat)
## type isnormal
## 1 BC no
## 2 BC no
## 3 BC no
## 4 BC no
## 5 BC no
## 6 BC no
summary(sampdat)
## type isnormal
## BC :75 no :242
## CRC :69 yes: 38
## NSCLC:69
## PCA :67
We’ll be using the ComplexHeatmap
package to produce our heatmap.
library(ComplexHeatmap)
We can start with a basic heatmap of our top 500 most variable genes and all samples.
Heatmap(dat,show_row_names = FALSE, show_column_names = FALSE)
Questions
Let’s try to help ourselves a bit by adding some biological variables to the sample columns.
ha = HeatmapAnnotation(df=sampdat)
Heatmap(dat,top_annotation = ha,
show_column_names = FALSE,
show_row_names = FALSE)
Questions
Now that you have biological annotations for each sample, answer this question again:
For this section, we’ll again use the most variable genes that we
defined at the beginning of the heatmap section. Performing a principal
component analysis in R starts with a matrix as input. In this case,
that input matrix, dat contains gene expression data. But
it could be any matrix that you want to simplify by reducing
dimensionality.
The main function for doing principal components analysis in R is the
prcomp function. In the following block of code, I’m going
to go ahead and apply it to our dat matrix.
# rakk=10 limits calculation to only 10 principal components. This is only for convenience.
pcres = prcomp(dat,rank. = 10)
We can quickly look at a summary of the results:
summary(pcres)
## Importance of first k=10 (out of 280) components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 28.0114 15.5878 11.06765 9.17050 6.07968 5.35715 4.9017
## Proportion of Variance 0.5096 0.1578 0.07955 0.05462 0.02401 0.01864 0.0156
## Cumulative Proportion 0.5096 0.6674 0.74695 0.80157 0.82558 0.84422 0.8598
## PC8 PC9 PC10
## Standard deviation 3.79590 3.43745 3.1151
## Proportion of Variance 0.00936 0.00767 0.0063
## Cumulative Proportion 0.86918 0.87685 0.8832
This summary has a lot of useful information. In particular, the
Proportion of Variance is how much of the variation of our
entire dataset is captured by each PC. For example, if
we summarize our gene expression dataset with the first PC, we will have
captured more than 50% of the total variation. The first seven PCs each
capture more than 1% of the total variation. After that, the PCs capture
less than that. The top 10 PCs capture a total of
88.32% of the variation (Cumulative Proportion).
I’m going to manipulate the PCA results into a
data.frame so that we can plot. Just take my word for it
right now.
pcdf = data.frame(sampdat,pcres$rotation)
summary(pcdf)
## type isnormal PC1 PC2
## BC :75 no :242 Min. :0.02484 Min. :-0.139694
## CRC :69 yes: 38 1st Qu.:0.04870 1st Qu.:-0.011519
## NSCLC:69 Median :0.05936 Median : 0.015814
## PCA :67 Mean :0.05842 Mean :-0.008278
## 3rd Qu.:0.06865 3rd Qu.: 0.032701
## Max. :0.08138 Max. : 0.062973
## PC3 PC4 PC5
## Min. :-0.1240729 Min. :-0.127509 Min. :-0.198953
## 1st Qu.:-0.0440144 1st Qu.:-0.031096 1st Qu.:-0.020829
## Median : 0.0054298 Median : 0.005786 Median :-0.005820
## Mean :-0.0005879 Mean : 0.005708 Mean :-0.001857
## 3rd Qu.: 0.0306895 3rd Qu.: 0.056669 3rd Qu.: 0.014221
## Max. : 0.1143564 Max. : 0.117534 Max. : 0.175172
## PC6 PC7 PC8
## Min. :-0.137764 Min. :-0.212817 Min. :-0.1347058
## 1st Qu.:-0.046031 1st Qu.:-0.033061 1st Qu.:-0.0391158
## Median :-0.015361 Median : 0.001675 Median : 0.0019482
## Mean :-0.005747 Mean :-0.002336 Mean : 0.0001779
## 3rd Qu.: 0.018247 3rd Qu.: 0.033663 3rd Qu.: 0.0312022
## Max. : 0.191060 Max. : 0.181785 Max. : 0.1860653
## PC9 PC10
## Min. :-0.1705955 Min. :-0.183806
## 1st Qu.:-0.0357108 1st Qu.:-0.033834
## Median :-0.0006643 Median : 0.002368
## Mean :-0.0009023 Mean : 0.001586
## 3rd Qu.: 0.0338716 3rd Qu.: 0.038594
## Max. : 0.1492061 Max. : 0.161559
head(pcdf)
## type isnormal PC1 PC2 PC3 PC4
## GSM2772660 BC no 0.06170998 0.0118201202 -0.07277076 0.026097242
## GSM2772661 BC no 0.04823465 0.0143191336 -0.04327190 0.014975967
## GSM2772662 BC no 0.07530785 0.0140713629 -0.05830088 0.004662037
## GSM2772663 BC no 0.06470332 0.0009215445 -0.08254243 0.037084681
## GSM2772664 BC no 0.07562771 0.0243711057 -0.06538010 -0.006604001
## GSM2772665 BC no 0.06645876 0.0100085881 -0.05425063 0.024063192
## PC5 PC6 PC7 PC8 PC9
## GSM2772660 -0.044013499 0.06995323 -0.004717639 0.12908292 -0.070579681
## GSM2772661 -0.070268681 -0.03519182 -0.050319551 0.13667807 -0.006154981
## GSM2772662 -0.036916960 0.07085866 -0.007843776 0.12549398 -0.007211999
## GSM2772663 0.006212287 0.08922814 -0.038393757 0.04796887 -0.165708025
## GSM2772664 0.008372729 0.02561096 0.049302349 0.09439716 -0.063215452
## GSM2772665 -0.023214048 0.05424151 -0.022572000 0.09366333 -0.016355000
## PC10
## GSM2772660 0.01606562
## GSM2772661 0.06852335
## GSM2772662 -0.02794572
## GSM2772663 -0.13703616
## GSM2772664 -0.13337958
## GSM2772665 -0.00746967
With our data.frame in hand that includes the sample information
and the PC values for each sample, we can use
ggplot2 to make some plots
library(ggplot2)
ggplot(pcdf,aes(x=PC1,y=PC2)) + geom_point()
This plot isn’t very helpful since we cannot see the biological variables of interest. Let’s add them:
ggplot(pcdf,aes(x=PC1,y=PC2,color=type,shape=isnormal)) + geom_point()
Exercise
To look at multiple variables at once, we can use the
GGally package to create multiple PCs in the the same
plot.
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(pcdf,columns=3:8,
mapping=aes(color=type,
alpha=0.5)) # alpha makes things a little transparent for easier viewing
The PCA part of things is really meant to give a taste of PC analysis and not to make us all PCA masters. When presented with large datasets, consider reading up and using PCA.